The Annals of Applied Probability
2008, Vol. 18, No. 1, 1-21
DOI: 10.1214/07-AAP452
(c) Institute of Mathematical Statistics, 2008
APPROXIMATE WORD MATCHES BETWEEN TWO
RANDOM SEQUENCES
By Conrad J. Burden, Miriam R. Kantorovitz^
AND Susan R. Wilson
Australian National University, University of Illinois
and Australian National University
Given two sequences over a finite alphabet C, the D2 statistic
is the number of m-letter word matches between the two sequences.
This statistic is used in bioinformatics for expressed sequence tag
database searches. Here we study a generahzation of the D2 statistic
in the context of DNA sequences, under the assumption of strand
symmetric BernouUi text. For fc < m, we look at the count of m-letter
word matches with up to k mismatches. For this statistic, we compute
the expectation, give upper and lower bounds for the variance and
prove its distribution is asymptotically normal.
1. Introduction. Methods for alignment-free sequence comparison are
among the more recent tools being developed for sequence analysis in biol-
ogy [16]. A disadvantage in the classical Smith- Waterman local alignment
algorithm [13], as well as the popular search algorithms such as FASTA and
BLAST, is that they assume conservation of contiguity between homologous
segments. In particular, they overlook the occurrence of genetic shuffling
[18]. Alignment-free sequence comparison methods are used to compensate
for this problem.
A natural alignment-free comparison of two sequences is the number of
m-letter word matches between the sequences. This statistic, called D2,
can be computed in linear time in the length of the sequences, which is
also an advantage over the nonlinear local alignment algorithms. D2 is used
extensively for EST sequence database searches (e.g., [2, 3, 11] and in the
software package STACK [6]).
In [10], Lippert, Huang and Waterman started a rigorous study of D2
using the model of independent letters in DNA sequences. A formula for
Received May 2006; revised May 2007.
^Supported by Australian Research Council Discovery Grant DP0559260.
AMS 2000 subject classifications. 60F17, 92D20.
Key words and phrases. DNA sequences, sequence comparison, word matches.
This is an electronic reprint of the original article published by the
Institute of Mathematical Statistics in The Annals of Applied Probability,
2008, Vol. 18, No. 1, 1-21. This reprint differs from the original in pagination and
typographic detail.
1
2
C. J. BURDEN, M. R. KANTOROVITZ AND S. R. WILSON
the expectation was computed as well as upper and lower bounds for the
variance. Limiting distributions, as the length of the sequences, n, and the
size of the word, m, get large, were derived in some cases. The authors used
Stein-Chen methods [5, 9, 14] to obtain the following results. When the
underlying distribution of the alphabet is nonuniform, the distribution of
D2 has normal asymptotic behavior when m/log^n < 1/6. The logarithmic
base b is defined by 6 = {J2a<^c Ca)~^) where is the probability of a letter
taking the value a. Following simulations, it was noted in [10] that the bound
1/6 above is too small. Our simulations in Section 6 suggest that the bound
should be closer to 2.
Another asymptotic regime was identified in [10] when m/log^n > 2. In
this case, the distribution of D2 has compound Poisson asymptotic behav-
ior. However, as pointed out in [17] and [1], the Poisson approximation is
meaningful in this region only when E{D2) is not too small. To control this
degenerate case, one needs to add the linear restriction m = 21og^n + C.
When the underlying distribution of the alphabet is uniform, it was proved
in [9] that for m = alogf^n + C with < a < 2, the distribution of D2 is also
asymptotically normal.
A natural generalization of the D2 statistic is to count the number of
(k)
approximate m-word matches. For k <m, let D2 be the number of m-word
matches with up to k mismatches between the two sequences. This statistic
can be expressed in terms of a distance function. One can define the distance
between two m-words to be the number of mismatches. A /c-neighborhood of
an m-word w is then all m-words that are at most k distance from w. The
(k)
D2 statistic is the number of A;-neighborhood matches of m-words between
two sequences.
In [12], Melko and Mushegian studied the A:-distance and fc-neighborhood
match count between a probe of length m and a random DNA sequence,
under the assumption that the sequence is strand-symmetric Bernoulli text.
They gave a formula for the expectation of the /c-distance match count
and the /c-neighborhood match count. Melko and Mushegian suggested that
methods of Lippert, Huang and Waterman in [10] could be used to obtain
(k)
upper and lower bounds for the variance of D2 and to analyze its asymp-
totic behavior.
(k)
In this paper we study the D2 statistic under the strand-symmetric
Bernoulli text assumption. We extend the method of [10] to give upper and
lower bounds for the variance. We analyze the asymptotic behavior of the
distribution of 1^2'^^ as n and m increase using the method of cumulants [8]
rather than Stein's method. For D2, the A; = case, this method improves
the bound on m/log^n obtained in [10] from 1/6 to 1/2.
The organization of this paper is as follows. In Section 2 we review def-
initions and introduce notation. In Section 3 we discuss the mean of
APPROXIMATE WORD MATCHES BETWEEN RANDOM SEQUENCES
3
Section 4 is devoted to the variance of -Dg ■ Section 5 we prove normal
(k)
asymptotic behavior of the distribution of D2 ■ Section 6 contains the results
of numerical simulations, and a concluding summary is given in Section 7.
A list of notations is provided at the end of Section 7.
2. Preliminaries. Let C = {A, G, C, T} with strand-symmetric probabil-
ity measure ^ = {(,a, £,Gi (,Ci Ct} and perturbation parameter r/. That is, — 1 <
rj <1 IS the unique number satisfying
Let A = A1A2 ■ ■ ■ An and B = B1B2 - ■ - be two random sequences of
length n over the alphabet C. We assume that A and B are Bernoulli
texts, meaning, the letters (nucleotides) are independent and identically dis-
tributed (i.i.d.). We note that the assumption of both sequences having the
same length is not essential for what follows and its main purpose is to
simplify notation. Our results can be easily adapted to the case when the
sequences are of different lengths.
Definition 2.1. Let x and y be two words of length m. We define the
distance between x and y to be
5(x, y) = number of character mismatches between x and y.
For k < m, we say that x is a k-distance match of y if 6{x,y) = k. When
^(x, y) < k, then x is said to be a k-neighbor of y.
Following the terminology and notation of [12], we have the following
definition.
Definition 2.2. In the above setup, define the perturbed binomial dis-
tribution with perturbation parameter rj by
9k{m, 7], c) = h{m, rj, c)uk{m, rj, c),
where < c,k < m are integers and
eT = i(i+r?),
h{m, rj, c
m—c
Uk{m,r],c)
m—k / \ /
i=0 ^ ^ ^
m — k — i
m — c
)
Vk{i,r],c)
4 C. J. BURDEN, M. R. KANTOROVITZ AND S. R. WILSON
For an m-word w with GC-count Cw, h{m, rj, Cw) is Pr(w), the probabihty of
seeing w. In the definition of Uk{m, rj, c), and in similar situations throughout
the paper, we fohow a general convention that (^) = if a < or a > n.
Note that when = 0, the perturbed binomial distribution is the binomial
distribution with gk{m,0,c) = bk{m, 1/4), where
As observed in [12], if T is a strand-symmetric Bernoulli text of length
m and q is a (known) query text (=word) of length m, then the probability
distribution of the distance (5(T,q) is a perturbed binomial distribution:
Pr(5(T, q) = fc) = c/fc(m, r], c),
where c is the GC-count in q and rj is the perturbation parameter of T. Let
k
Gk{m,r],c) =J29r{m,r],c) =Pr((5(T,q) <k)
r=0
be the cumulative distribution function of the distance.
2.1. k -neighborhood matches. Let A and B be two DNA sequences of
length n. Assume the sequences are strand-symmetric Bernoulli text with
perturbation parameter rj. Let 0<k<m<nhe integers.
Definition 2.3. Define the statistic 0^2^^ = D{k,m,n) to be the num-
ber of fc-neighborhood m-word matches between the sequences A and B,
including overlaps. Note that D2^^ is the D2 statistic of [10].
(k)
The Df' statistic may be computed as follows.
Notation. For 1 < s <t <n, write A[s, t] for the subsequence AgAg-^i . . .
At.
(k)
Definition 2.4. Let Y-^ be the ^-neighborhood match indicator (start-
ing) at position (position i in sequence A and j in B). That is,
y{k) ^ r 1, if d{A[i, i + m- l],B[j, j + m-l])<k,
\ 0, otherwise.
Then the d!^'' statistic can be computed via
(*,i)6/
APPROXIMATE WORD MATCHES BETWEEN RANDOM SEQUENCES 5
where the index set / is
/ = GNxN:l<i<n-m + l, and l<j<n-m + l}.
For convenience, we write n for n — m + 1.
(k)
3. The mean of • In this section we give a general formula for the
(k)
mean of in terms of the perturbed binomial distribution and obtain
estimates for it. The estimates will be used in later sections in order to
(k)
prove normal asymptotic behavior of D2 ■
(k)
First we compute the mean of Y^j ':
^Kf ] = Pr(llf = 1)
= ^ Pr(5(A[i,z + m - l],vir) < A:)Pr(B[j, j + m - 1] = w)
= X! Gk{m,r],c^)Vi{Mv),
where Cw is the GC-count of w.
From this we get formulas for the expectation of D2 ■
71^
Pr(w)Gfc(?n,7?,Cw).
Remark 3.1. When k = 0, we have Gfc(m, 77, Cw) = Pr(w) and
^(0)1
■ij
This agrees with the formula given in Lippert, Huang and Waterman [10],
E[Yi,]=pl^\ where P2 = Eae£C^
Definition 3.2. For t > 1, let
Remark 3.3. Note that pt = E[{^xY~^]- Hence, by the Cauchy-Schwarz
inequality,
P3 >pI,
where equality holds if and only if the distribution is uniform: = 1/|£|.
6 C. J. BURDEN, M. R. KANTOROVITZ AND S. R. WILSON
3.1. Estimates. The purpose of these estimates is to explain the asymp-
totic behavior of -Dg*^^ , rather than to provide a computational tool. Hence,
these estimates are by no means optimal.
First we estimate the function gi^(^Tn,Tj, c-^) . Without loss of generality,
assume > 0. From Definition 2.2, we have upper bounds:
Vk{i,ri,c) <
Ukim,r],c) < —— J2
3 + r?\^
1 — r]
"J + ^ \ I ^\ I iTi — c \ ( ^ + ( m
I — rj J \i J \m — k — i J \1 — i] J \ k
Remembering that h{m,r],c-w) = Pr(w) and using similar estimates for the
lower bound we get
i, h;-— <5'fc("i,r/,Cw) <Pr(w) '
K J \1 +riJ \ K J \l — T]
and
Pr(w)E(7)(^)'<G.K,,c.)
(1)
< Pr(w) 2^ '
r=0
1 — 7]
Hence, for E[Y^f^] =J2^|=oy^PT{w)Gk{m,r],c^), we have
Finally, since E[d!^^] =J2fj^^E[Y^J'^], we have
Remark 3.4. For k = (the exact matches case), (3) gives E[d!^^] =
n^p™, which agrees with the expectation computed in [10].
Remark 3.5. When r] = (the uniform case), P2 = \ and the upper and
lower bounds in (3) are equal. Hence,
APPROXIMATE WORD MATCHES BETWEEN RANDOM SEQUENCES
7
k
r=0
k
3 \ ^' / 1 ^ Tn—r
r I V 4 j U,
r=0
4. The variance. In this section we give a lower and upper bound for
(k)
the variance oi D2 ■ The lower bound is used later to prove asymptotic
(k)
normality of D2 ■ The upper bound is not optimal, but is comparable with
that given in [10] for the A; = case. We start this section by stating the
main results in Propositions 4.1 and 4.2. We then prove several technical
lemmas and finish with the proofs of Propositions 4.1 and 4.2.
Proposition 4.1. Var(L»^ ^ >^ =
l{n, m, k), where
2k
■ n
(2n-4m + 2)(m-l)A;2^ ^ ^r'"^2m^P3
l + V
Proposition 4.2.
Var(D^ < ^ (2ra - 4m + 2)m^''' i^-
(4) -n2(2n-4m + 2
.Y3 + r/
2k
2pi
-7]
2P3
1 -P3
P3
P2
To compute the variance oi D2 = J2{i,j)GiYij ; we need to compute the
covariances CoY{Yff\Y},'^}).Fov this, we use techniques from [17]. To shorten
the indices' notation, let u = and v =
In the following definition we use notation and terminology from [17],
Chapter 11.
Definition 4.3. Let Ju = {v = -.{i' — i\ <m or \j' -j\ < m}. Then
Ju is the dependency neighborhood of Yu in the sense that if v ^ Ju, then
(k) (k)
Y^' and Y^'^' are independent. The dependency neighborhood can be de-
composed into two parts, accordion and crabgrass, Ju = where
Ju={v= eJu:\i'-i\<m and \j' -j\<m} and Jl = Ju\
8
C. J. BURDEN, M. R. KANTOROVITZ AND S. R. WILSON
Let -u G /. When v i Ju, Cov{Y^ \Yr') = 0. To estimate Cov(yr\yrO
when V & Ju, we look at the two cases: v £ (crabgrass) and v £ (ac-
cordion). We will see that crabgrasses contribute the dominant term of
Var(£»f ^) in the cases we are interested in, that is, for m = O(logn). Hence
for accordions we only give a crude approximation of Cov{Yu^\Yv''^). We
start by proving the following positivity lemma.
Lemma 4.4. (i) For v G J^, Y^''^ and Y^''^ are nonnegatively correlated.
That is,
(ii) For V in the main diagonal of J^, Cow{YT,Yr')>Q.
Proof. We will use the following notation. For r > define
Y^p {r) = the fc-neighbor match indicator between two r- words at (i, j).
\i[r) = 5[A[i,i + r- 1], j + r - 1])
= number of mismatches in an r-word match at {i,j).
^2('") = (^(^[^' + m — r,i' + m — V\,B[j' + m — r,j' + m — 1])
= number of mismatches in an r-word match
at {i' + m — r,j' + m — r).
To prove part (i), let u= {i,j), v = {i' G J^. Write t = i' — i and s = j' — j.
By symmetry, we may assume v is in the first quadrant of J^, that is,
< t < m — 1, and m< s. We have
Cov(yW,yW)>o.
Pr(yW = i,yi'=) = i)
J2 Pr(Ai(t) = Zi)Pr(A2(t) = /2)
ll,l2=0
(5)
xPr(y(f;|^,))(m-t) = i,y(
{i'd')
{rn-t) = 1)
^ Pr(Ai(t) = /i)Pr(A2(t)=/2)
;i,«2=o
X ^ Pr(w)Gfc_i,(m-t,?7,Cw)Gfc_i2("^-*'^.Cw)
-wG£'"-*
APPROXIMATE WORD MATCHES BETWEEN RANDOM SEQUENCES 9
J2 Pr(Ai(t) = li)Gk^i, (m - t, rj, Cw)
L h
For wG/:"^-*, let
(6) (w) = E Pr(A(t) = (m - t, r?, Cw) ,
I
where A(t) is the distance between two random t- words. Then (5) says that
i?[yWyW] = i<;[(/,(M^))2].
Similarly we get
E[Y:^'^^]E[Y,^'^] = E[ftiW)f.
Hence,
(7) Cov(yj'=), y«) = VariftiW)) > 0.
For part (ii), let u = and v = G J^'s main diagonal, that is,
V = {i + t,j + 1) with |t| < m — 1. By symmetry, we may assume <t. As
before, let Ai(t) be the number of mismatches in a t-word match at
and let A2{t) be the number of mismatches at (i + m,j + m). Then
£;[yWyW]= ^ Pr(Ai(t) =/i)Pr(A2(t) = ^2)
X Pr(yJ'=-'i)(m - t) = l,yj'=-'2)(m - t) = 1)
t
= E Pr(Ai(t)=/i)Pr(A2(t) = /2)
^1,^2=0
X Pr(yJ"^^"^*^-^i'*^-'2})(^_i) = 1)
and
E[yP]E[Y^^^]= E Pr(Ai(t) = /i)Pr(A2(t) = /2)
/l,«2=0
X Pr(yJ'=-'i)(7n -t) = l)Pr(yJ'^-'2)(m - t) = 1).
Since
p^(yJmin{fc-ii,fc-«2})(^ - t) = 1)
> Pr(yJ'=-'i)(m - t) = 1) Pr(yJ^-'2)(,„ -t) = l),
we have that Cov(yi''^ , yj''^ ) > 0. □
10 C. J. BURDEN, M. R. KANTOROVITZ AND S. R. WILSON
Remark 4.5. From (7) we have that for v G J^, Cov(yi''\yi^^) =
Var(/t(VF)) for appropriate t. When computing /t(w) in (6), it is worth
noting the following:
1. Pr(A(t)=0 = ExG£'Pr(x)gKi,r/,Cx)=Exe/:'Pr(x)2ni(t,r?,Cx).R:om Sec-
tion 3.1 we also have
Hence
2. For k — I >m — t and w G /Z"'"*, Gk~i{m — t, r], Cw) = 1-
Remark 4.6. When k = (exact matches case), and v € with t as
above, we get:
For wG/:™-*,
/t (w) = Pr(A(t) = 0)Go(m - t, r], Cw) = pI Pr(w).
Since E[Fv{W)] = Ewe£--' PKw)^ = p™"* and E[Pr(VF)2] =
Ewe£'"-* Pr(w)^ =^3*"*' we have that
Cov(yW,yW) = Var(A(VF)) =prVar(Pr(w)) =pr(Pr* "P?""*^)-
This agrees with the computations in [17], Section 11.5.2.
Remark 4.7. When rj = (uniform case), /t(w) does not depend on w
and hence Var(/f(l^)) = 0. Therefore, in this case, fort;G J^,Cov(yi'=\yi'=)):
0.
Next we look at the following special crabgrass case.
Lemma 4.8. Let u = and v = {i' G with t = \i' — i\ = m — 1
or (by symmetry) \j' — j\ = m — 1. Then
Cov(yW,yW) = [Pr(A(m - 1) = k)]\p,-pl),
where A(m — 1) is the distance between two random {m — 1) -words.
Hence, by Remark 4-5,
m-l\ fS-r]\'']^
P2
k J \l+r]
r(|-i)<Cov(yi'=),yW)
<
m — 1\ f 3 + r]
k J \l-r]
.1"(|-).
APPROXIMATE WORD MATCHES BETWEEN RANDOM SEQUENCES 11
Proof. By (7),
Cov(yW,yJ^'))=Var(/i(Ty)).
Let w £ C. Then, using Remark 4.5,
/^(w) = ^ Pr(A(m - 1) = l)Gk^i{l, ri, c^)
1=0
k-l
= Pr(A(m - 1) = A;)Go(l, V, Cw) + Pr(A(m -l) = l)-l
1=0
k-l
= Pr(A(m - 1) = A;)^w + Pi'(A(m - 1) = 0-
/=o
Note that Pr(A(m - 1) = k) and J2'i=o Pr(A(m -l) = l) do not depend on
w. Hence,
Var(/t(VF)) = Var ^Pr(A(?n - 1) = k)Cw + J2 ^^(^("1 - 1) = 0^
= [Pr(A(m - 1) = k)fYaT{^w)-
As noted before (Remark 4.6, with m — t = I), Var(^vy) =P3 ~ P2- ^
For the accordion case, we use the fohowing crude estimate.
Lemma 4.9. For u, v G /, Cov(yi^\ yj''^) = 0{p^m'').
Proof.
(8)
|Cov(yj'=),yW)| < Vvar(yi'))Var(yi'^))
= Var(yW)<i?[(yW)^]=i?[yi^)]
<
r=0
f?Ec;j(^^) fro- (2
4.1. Proof of Proposition 4.1. We now prove the lower bound formula
for \ax{Df'^) stated in Proposition 4.1.
12
C. J. BURDEN, M. R. KANTOROVITZ AND S. R. WILSON
Proof of Proposition 4.1. First we split the variance into the con-
tributions of crabgrasses and accordions:
varrf))= Y: Cov(yW,yW)
= E E Cov(yW,yW) + 5: y: Cov(yW,yj'^)).
Next we look at the crabgrasses:
> E E
U=(i,j)€l v={i' ,j')(iJ^
\i' —i\=m—l or \j' —j\=m—l
> E E
u=(i,j)ei «=(i'j')eJS
\i' —i\=m—l or \j'—j\=m—l
Cov(yi'=), yj'^)) by Lemma 4.4
1 + r/
P2
by Lemma 4.8
(2n -4m + 2)
k
l + V
2k
\P2
Finally we consider the contribution of the accordions to the variance:
E E Cov(yW,yi'^))
<EE lcov(yw,yi^))l
^ E E ofe
by Lemma 4.9
Then
Var(D5'=)) = E E Cov(yW,yJ'=)) +0(n2m'=+2p™)
>2n^ (2n-4m + 2)
m- 1\2 /3-77\2^ 2mYP3
1 + 7?
P2
□
4.2. Proof of Proposition 4.2. The first two terms in (4) come from crab-
grasses. Let u = {i,j) G I, V = {i',j') € with i' = i + t, <t <m — I. We
APPROXIMATE WORD MATCHES BETWEEN RANDOM SEQUENCES 13
need to bound Cov(yi''\ fJ''^) = Var(/t(P^)):
E[ft{Wf]= P^(^)
< J2 Pr(w)=
'm—l
^FiiAit) = l)Gk^iim-t,ri,c^)
=0
71—1 k—l
=0
r=0
m — t\ fS + rj
r J [T~^
by (1)
(m-t)''
<P3
{m-ty
'm—l "1 2
EPr(A(t)=/)
E Pr(w)=
/3 + 7?V
fel 2
and similarly,
m—l
E Pr(w) E Pr(A(i) = ^^^-/(m - t, r?, Cw)
>
>
m — l fc— i
5: Pr(w)2^Pr(A(i)=/)E
n 2
m-t\ f^-ri
r ) [l + r]
m — l
Y: Pr(w)2^Pr(A(t)=0
Lwe£™-t /=0
2
E
m-ti2
Hence,
\1 — 7] J
Summing up over all n's and f 's and using
J2 E = ^^(2^ - 4m + 2)
2g
1 -g"
1-Q
14 C. J. BURDEN, M. R. KANTOROVITZ AND S. R. WILSON
2k
with q = P3 and p^j respectively, yields
2k
3 + 7]
1 — r]
2P3
1 -P3
P3
-n^(2n-4m + 2)
The last term in (4) comes from accordions:
^ ^ Cov(yW,yJ'=))<5: 5: |Cov(yW,yW)|
■ n
'(2m-l)2p-^
r=0
3 + 7?
1 — r]
3 + r?
1 — r]
from (8)
5. Asymptotic behavior. We will need the following central limit theo-
rem of Janson [8] for certain sums of dependent random variables. To state
it, we first recall the definition of dependency graph.
Definition 5.1. A graph T is a dependency graph for a family of ran-
dom variables if the following holds:
1. There is a one-to-one correspondence between the random variables and
the vertices of the graph.
2. If Vi and V2 are two disjoint sets of vertices of F such that there is no
edge (fi,f2) in F with vi G Vi and V2 € V2, then the corresponding sets
of random variables are independent.
Also recall that the maximal degree of a graph is the maximal number of
edges attached to a single vertex.
Theorem 5.2 ([8], Theorem 2). Suppose that for each n, {Wni}f=i is
a family of hounded random variables; \Wni\ < C„ almost surely. Suppose
further that F„ is a dependency graph for this family and let M„ he the
maximal degree of F„ [if F„ has no edges, set Mn = !)• Let Sn = Y^f=i Wni
and = Var(S'„). If there exists an integer t such that
(9)
then
{Nn/Mnf''MnCn/cTn ^0 aS
n—>-oo,
{Sn- E{Sn))/ an ^M{0,1)
as n — oo.
APPROXIMATE WORD MATCHES BETWEEN RANDOM SEQUENCES 15
Next we state and prove our main theorem.
Theorem 5.3. Assume that the four letters of the alphabet C are not
uniformly distributed. That is, the perturbation parameter r] is not zero. Let
fin = E[d!^^] and Un = V^Varpf ^).
For m = a log^^^^^ i^) + C with < a < 1/2 and C a constant, and fixed k
such that < k < m,
D
(k)
M{0, 1) as oo.
Proof. We apply Theorem 5.2 to the match indicator random variables
(k) 9
y> ^ In this case, the dependency graph has n vertices and edges may be
defined by connecting the vertex {i,j) with (i', j') if |z' — i| < m or \j' — j\ <
m. Hence, in the notation of Theorem 5.2, A'^^ = n^; C„ = 1; the maximal
degree of r„ is the size of a dependency neighborhood:
\Ju\ = (2m - l)(2n - 2m + 1);
(fc)
and Sn = D.^
Let m = a\ogiip^{n) + C with < a (and k fixed). Then for a < 1, the
lower bound, i, for cr^ in Proposition 4.1, has the property
(2n -Am + 2)
7]
2k
P2
2m P3
= Cm^"^" + 0(n2-"(log(n))'=+2
~ 7^3-2a since a < 1.
Therefore, in condition (9) we have
iNjMn)^/'MnCn
1
+ 0(nV=+2p^^
where Ci > is a constant
(nV((2m - l)(2n - 2m + l)))^/*(2m - l)(2n - 2m + 1)
<
n
(2m-l)(2n-2m + l)
<7n
1/t
(2m-l)(2n-2m + l)
(10)
n
(2n - 4m + 2)
3 — 7]
l + T]
2k
,2m. I P3_
pI
1
P2
+ 0(n2m'=+V2")
l/2s
16 C. J. BURDEN, M. R. KANTOROVITZ AND S. R. WILSON
(c.logi/p^(n))i-iAni+V* ^ (alogi/p^(n))i'i/*
~ [n3-2«]l/2 „l/2-l/t-a
^0 asn^oo, if 1/2 - 1/t - a > 0.
Thus, for a < 1/2, we can find t large enough such that 1/2 — 1/t — a > 0.
□
In [10], Lippert, Huang and Waterman used a variation on Stein's resuh
([15], page 110), due to Dembo and Rinott ([7], Theorem 4.2), to prove
the fohowing result, under the assumption of i.i.d. letters, for the D2 =
statistic. Let C be an alphabet set of size \C\ > 1 with nonuniform probability
measure ^. Then for m = alogi/p2{n) + C with < a < 1/6,
=>J\[0,1) asn^oo.
Following simulations, it was noted in [10] that the bound 1/6 above is too
small. Our simulations in Section 6 suggest that the bound should be closer
to 2.
By adjusting the proof of Theorem 5.3 to an alphabet set of any size
\C\ > 1, we can improve the bound on a from 1/6 to 1/2. Thus we have the
following theorem.
Theorem 5.4. Let L he an alphabet set of size \C\ > 1 with nonuniform
probability measure Then for m = alogi/p2{n) + C with <a < 1/2,
D2{n)- fin d .
=^Jv[0,l) as n—> 00.
Proof. From the lower bound for D2 in [10],
yai:{D2) > n
(2m - l)(2n - 4m + 2)p^™(p3/p^ - 1)
+ p^fl±Il^_^2m-l)p-
V 1 -P2
~ alogi/p^{n){n^ ^") since (ps/p^ — 1) > by Remark 3.3.
Hence, in (10) in the proof of Theorem 5.3 we now have
(iV„./M„)V*M„a
CTn
_ (nV((2m - l)(2n - 2m + l)))^/\2m - l){2n - 2m + 1)
APPROXIMATE WORD MATCHES BETWEEN RANDOM SEQUENCES 17
^2 ^ 1/i
<(- -— (2m-l)(2n-2m + l)
(2m-l)(2n-2m + l
X f n
(2m - l)(2n - 4m + 2)p^™ ( ^ - 1
(alogi/p^(n))i-V*ni+V* ^ (alogi/^^ (n)) V^'V*
[alogi/p^(n)n3-2a]i/2 nVs-i/t-a
^0 as n ^ cx), if 1/2 - 1/t - a > 0.
The rest of the proof is the same as the proof of Theorem 5.3. □
Remark 5.5. When the underlying distribution is uniform, Yar{D^^) ~
■ Hence, this method of proof fails to show normal asymptotic behavior
in the uniform case. In fact, for A; = 0, the distribution of 02'^ is not normal
when |£| = 2, m = 1 and n — > 00 (see [10]).
6. Numerical simulations. We have carried out numerical simulations
of pairs of randomly generated sequences of length n = 100 x 2*, i = 0, . . . , 4
with the nonuniform letter distribution iA = = \-, ic = = h- The statis-
tic D2 was calculated for each sequence pair using an algorithm based on
that given in [12]. Kolmogorov-Smirnov values [4] for the standardized
statistic {D^"^ — fin)/crn compared with the standard normal distribution
for sample sizes of 2500 sequence pairs are shown in Figure 1. Samples of
(k)
D2 which are a close approximation to the normal distribution will have
p- values distributed uniformly on the interval [0, 1], whereas samples which
are a poor approximation to the normal distribution will have small p- values.
Entries in the tables in Figure 1 are shaded to indicate proximity of sam-
ples to the standard normal distribution, with lighter shades signaling a
better agreement. The white diagonal line in each table is m = 21ogi/p2 ^ ~^
const. The numerical evidence strongly suggests that
— =^N{0,1) as n — i- 00,
where the limit is taken along any line m = alogi^p^{n) + C with < a < 2
for fixed k and C.
7. Conclusions. We have studied the 02^^ statistic as suggested in [12],
and defined it as the number of m-word matches with up to k mismatches
18
C. J. BURDEN, M. R. KANTOROVITZ AND S. R. WILSON
Fig. 1. Kolmogorov-Smirnov p-values for nonuniform d'"2^ with letter distribution
=Ct = |, = Cg = I compared with a normal distribution. The white diagonal lines
are m = alog^/p^n+ const., witha = 2 and '^/P2 = l/J2ae{A,c,G,T}^"-^T'-
APPROXIMATE WORD MATCHES BETWEEN RANDOM SEQUENCES 19
between two sequences of length n, for strand-symmetric Bernoulli texts
with a nonuniform letter distribution. We have extended methods applied
in [10] to give upper and lower bounds on the variance, and have also studied
the asymptotic behavior as the sequence length n and word length m tend
to infinity for fixed k.
(k)
We have proved that the asymptotic distributional behavior of D2 is
normal as n ^ 00 for a pair of strand-symmetric Bernoulli texts provided
the limit is taken along any line m = alogi/p^ n C with < a < ^. For
/c = this result is also shown to hold for a pair of Bernoulli texts with any
nonuniform letter distribution. This improves the previous bound for the
A; = case of a < g given in [10].
We have also carried out numerical simulations of strand-symmetric texts
with letter distribution S,a = = \ , ic = = \ - These simulations strongly
suggest that the optimum restriction on asymptotic normal behavior may be
as high as a < 2. This is consistent with simulations in [10] and their result
that the asymptotic distributional behavior of D^^ is a compound Poisson
distribution for a>2.
List of notations.
D2: The number of m-letter word matches between two given sequences.
D2 ■ The number of m-letter word matches with up to k mismatches (0 <
k <m) between two given sequences.
gk{m,r],c): The perturbed binomial distribution (Definition 2.2). Given a
strand-symmetric Bernoulli text of length m and perturbation parameter
r], and an m-word with GC-content c, this is the probability distribution
of the number of character mismatches between the text and the m-word.
Gfc(m, rj, c): The cumulative distribution function of the perturbed binomial
distribution gk{m,r],c).
hk{Tn,rj^c): For a given m-word with GC-content c, the probability that the
word occurs at a given site in a strand-symmetric Bernoulli string with
perturbation parameter 77. (See Definition 2.2.)
Ju'- The dependency neighborhood of Yu^^ , where u= that is, the word
locations v = {i' such that either the word at i' overlaps the word at i
in the first sequence, or the word at j' overlaps the word at j in the first
sequence, or both. (See Definition 4.3.)
J^: The accordion, that is, the subset of Ju such that both the word at i'
overlaps the word at i in the first sequence, and the word at j' overlaps
the word at j in the first sequence.
J^: The crabgrass, that is, the subset of Ju such that either the word at i'
overlaps the word at i in the first sequence, or the word at j' overlaps the
word at j in the first sequence, but not both.
20
C. J. BURDEN, M. R. KANTOROVITZ AND S. R. WILSON
k: The number of mismatches.
m: The word length.
n: The length of each of the two sequences.
n: n — m + 1, the number of possible locations of an m-word in a sequence
of length n.
Pt- "llaec^a^ where the sum is taken over the alphabet C.
Mfc(m, 77, c), ry, c): Functions occurring in the definition of gk{m,r],c).
(See Definition 2.2.)
Yu''^ or y}-'^ : The indicator random variable for the event that the ?Ti-word
starting at position i in the first sequence has no more than k mismatches
with the m-word starting at position j in the second sequence. We use
the convention u= {i,j),v = throughout.
77: The perturbation parameter for a strand-symmetric Bernoulli text. (See
Section 2.)
S^a- The probability of finding the letter a at a given location in a strand-
symmetric Bernoulli string.
REFERENCES
[1] Barbour, A. and Chryssaphinou, O. (2001). Compound Poisson approximation:
A user's guide. Ann. Appl. Probab. 11 964-1002. MR1865030
[2] Burke, J., Davison, D. and Hide, W. (1999). d2 cluster: A validated method for
clustering EST and full-length cDNA sequences. Genome Res. 9 1135-1142.
[3] Carpenter, J. E., Christoffels, A., Weinbach, Y. and Hide, W. A. (2002).
Assessment of the parallelization approach of d2 cluster for high-performance
sequence clustering. J. Comput. Chem. 23 755-757.
[4] CONOVER, W. J. (1999). Practical Nonparametric Statistics. Wiley, New York.
[5] Chen, L. H. Y. (1975). Poisson approximation for dependent trials. Ann. Probab. 3
534-545. MR0428387
[6] Christoffels, A., van Gelder, A., Greyling, G., Miller, R., Hide, T. and
Hide, W. (2001). STACK: Sequence tag alignment and consensus knowledge-
base. Nucleic Acids Res. 29 234-238.
[7] Dembo, a. and Rinott, Y. (1996). Some examples of normal approximations by
Stein's method. IMA Vol. Math. Appl. 76 25-44. MR1395606
[8] Janson, S. (1988). Normal convergence by higher semiinvariants with applications
to sums of dependent random variables and random graphs. Ann. Probab. 16
305-312. MR0920273
[9] Kantorovitz, M. R., Booth, H. S., Burden, C. J. and Wilson, S. R. (2007).
Asymptotic behavior of fc-word matches between two random sequences. J. Appl.
Probab. 44 788-805.
[10] Lippert, R. a., Huang, H. and Waterman, M. S. (2002). Distributional regimes
for the number of fc-word matches between two random sequences. Proc. Natl.
Acad. Sci. USA 99 13980-13989. MR1944413
[11] Miller, R. T., Christoffels, A. G., Gopalakrishnan, C, Burke, J., Ptitsyn,
A. A., Broveak, T. R. and Hide, W. A. (1999). A comprehensive approach to
clustering of expressed human gene sequence: The sequence tag alignment and
consensus knowledge base. Genome Res. 9 1143-1155.
APPROXIMATE WORD MATCHES BETWEEN RANDOM SEQUENCES 21
[12] Melko, O. M. and Mushegian, A. R. (2004). Distribution of words witli a prede-
fined range of mismatches to a DNA probe in bacterial genomes. Bioinformatics
20 67-74.
[13] Smith, T. F. and Waterman, M. S. (1981). Identification of common molecular
subsequences. J. Mol. Biol. 147 195-197.
[14] Stein, C. (1972). A bound for the error in the normal approximation to the distribu-
tion of a sum of dependent random variables. Proceedings of the Sixth Berkeley
Symposium on Mathematical Statistics and Probability 2 583-602. Univ. Cali-
fornia Press, Berkeley, CA. MR0402873
[15] Stein, C. (1986). Approximate Computation of Expectations. IMS, Hayward, CA.
MR0882007
[16] ViNGA, S. and Almeida, J. S. (2003). Alignment-free sequence comparison — a re-
view. Bioinformatics 19 513-523.
[17] Waterman, M. S. (2000). Introduction to Computational Biology. Chapman and
Hall/CRC, New York.
[18] Zhang, Y. X., Perry, K., Vinci, V. A., Powell, K., Stemmer, W. P. and del
Cardayre, S. B. (2002). Genome shuffling leads to rapid phenotypic improve-
ment in bacteria. Nature 415 644-646.
C. J. Burden
.John Curtin School of Medical Research
AND Mathematical Sciences Institute
Australian National University
Canberra, A.C.T. 0200
Australia
E-MAIL: conrad.burden@anu.cdu.au
M. R. Kantorovitz
Department of Mathematics
University of Illinois
Urbana, Illinois 61801
USA
E-MAIL; ruth@math.uiuc.edu
S. R. Wilson
Mathematical Sciences Institute
Australian National University
Canberra, A.C.T. 0200
Australia
E-MAIL: sue.wiIson@anu.edu.au